Skip to content

[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)#37190

Open
e1n00r wants to merge 3 commits intovllm-project:mainfrom
e1n00r:feature/moe-expert-lru-cache
Open

[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)#37190
e1n00r wants to merge 3 commits intovllm-project:mainfrom
e1n00r:feature/moe-expert-lru-cache

Conversation

@e1n00r
Copy link
Copy Markdown

@e1n00r e1n00r commented Mar 16, 2026

Purpose

CachedWeightProvider — MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.

Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.

No runner bypass — all paths go through quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.

References: RFC #38256 | tinyserve (production validation, 481 tests)

Test results

Community validation (independent):

Model VRAM tok/s Tester
Nemotron-Cascade-2-30B-A3B (cache=8) 7.6 GB 15.6 @caiovicentino
Gemma-4-26B-A4B-it (cache=8) 8.6 GB 14.8 @caiovicentino

LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.

Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).

Changes

15 files, ~810 additions

File What
expert_weight_provider.py (new) CachedWeightProvider with LFRU eviction, ExpertWeightResult dataclass
fused_moe_method_base.py supports_expert_lru_cache property (default False)
fused_moe_modular_method.py Provider check in apply()
layer.py _maybe_init_expert_lru_cache(), expert_weight_provider attribute
unquantized_fused_moe_method.py CPU weight allocation, cache init, kernel init for cache path, XPU transpose
quantization/fp8.py supports_expert_lru_cache, provider check, cache init
offload.py moe_expert_cache_size config field
vllm.py Cross-validator: enforce_eager required
arg_utils.py CLI argument --moe-expert-cache-size
llm.py moe_expert_cache_size parameter in LLM.__init__
basic_correctness.yaml CI test area registration
docs/features/moe_cache_policies.md (new) Feature documentation
test_expert_lru_cache.py (new) 26 unit tests with parametrization
test_moe_expert_cache.py (new) Integration test via compare_two_settings
benchmarks/qwen_122b_test_20260331.txt (new) Benchmark raw data

How it works

moe_expert_cache_size == 0 (default):
  No provider created. Zero overhead (one getattr per layer).

moe_expert_cache_size > 0:
  CachedWeightProvider.prepare(topk_ids):
    for each unique expert:
      hit  → update LFRU frequency + recency (O(1))
      miss → evict lowest freq/age score, H2D copy, update mapping
    remap topk_ids → slot indices via persistent GPU mapping tensor
  → kernel receives GPU buffer + remapped IDs

Limitations

  • --enforce-eager required (CUDA graph compat deferred to PR 2)
  • Synchronous H2D copies (async pipeline in PR 2)
  • Single eviction policy (LFRU hardcoded, no pluggable framework)
  • EP > 1 not supported
  • BF16 + FP8 per-tensor only

Test plan

pytest tests/kernels/moe/test_expert_lru_cache.py -v
pytest tests/basic_correctness/test_moe_expert_cache.py -v -s

AI-assisted development (Claude Code). Architecture validated in tinyserve.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the frontend label Mar 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic LRU cache for MoE expert weights, a valuable feature for reducing GPU memory consumption. The implementation is well-structured, adding new configurations, a dedicated LRU cache class, and integrating it into the MoE layer. The new tests for correctness are also a great addition. My main feedback focuses on a performance issue within the LRU cache implementation itself, which could be optimized for better efficiency, especially with larger cache sizes.

Comment on lines +100 to +117
for expert_id in unique_ids:
if expert_id in self._expert_to_slot:
self._lru_order.remove(expert_id)
self._lru_order.append(expert_id)
self.hits += 1
else:
if self._free_slots:
slot = self._free_slots.pop()
else:
evicted = self._lru_order.pop(0)
slot = self._expert_to_slot.pop(evicted)

self._buf_w13[slot].copy_(self._cpu_w13[expert_id])
self._buf_w2[slot].copy_(self._cpu_w2[expert_id])

self._expert_to_slot[expert_id] = slot
self._lru_order.append(expert_id)
self.misses += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current LRU cache implementation uses a list for _lru_order, which results in O(N) complexity for remove() and pop(0) operations, where N is the cache capacity. This can become a performance bottleneck for larger cache sizes.

To improve performance to O(1) for these operations, I recommend refactoring the LRU logic to use collections.OrderedDict.

This would involve the following changes:

  1. In __init__, change _lru_order to an OrderedDict:

    from collections import OrderedDict
    
    # ...
    # LRU state (Python-only; must stay outside torch.compile).
    self._expert_to_slot: dict[int, int] = {}
    self._free_slots: list[int] = list(range(capacity))
    # Front = least-recently-used expert ID.
    self._lru_order: OrderedDict[int, None] = OrderedDict()
  2. Update the prepare method to use OrderedDict methods for efficient LRU management, as shown in the suggestion below.

Suggested change
for expert_id in unique_ids:
if expert_id in self._expert_to_slot:
self._lru_order.remove(expert_id)
self._lru_order.append(expert_id)
self.hits += 1
else:
if self._free_slots:
slot = self._free_slots.pop()
else:
evicted = self._lru_order.pop(0)
slot = self._expert_to_slot.pop(evicted)
self._buf_w13[slot].copy_(self._cpu_w13[expert_id])
self._buf_w2[slot].copy_(self._cpu_w2[expert_id])
self._expert_to_slot[expert_id] = slot
self._lru_order.append(expert_id)
self.misses += 1
for expert_id in unique_ids:
if expert_id in self._expert_to_slot:
self._lru_order.move_to_end(expert_id)
self.hits += 1
else:
if self._free_slots:
slot = self._free_slots.pop()
else:
evicted, _ = self._lru_order.popitem(last=False)
slot = self._expert_to_slot.pop(evicted)
self._buf_w13[slot].copy_(self._cpu_w13[expert_id])
self._buf_w2[slot].copy_(self._cpu_w2[expert_id])
self._expert_to_slot[expert_id] = slot
self._lru_order[expert_id] = None
self.misses += 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8fc9268 — replaced list-based _lru_order with collections.OrderedDict. move_to_end() for hits and popitem(last=False) for eviction are both O(1).

@mergify mergify bot added the ci/build label Mar 16, 2026
Copy link
Copy Markdown
Contributor

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-designed feature — the LRU expert cache is a natural approach for running MoE models that exceed GPU memory. The implementation is clean and the code is well-documented. Here's a detailed review:

1. Thread safety concern in ExpertLRUCache.prepare()

The prepare() method mutates _expert_to_slot, _free_slots, and _lru_order without synchronization. In vLLM's current architecture, the forward pass is single-threaded on the model runner, so this is fine today. But if vLLM ever moves to concurrent forward passes (e.g., disaggregated prefill/decode with shared model weights), this would race. Worth a comment noting the single-threaded assumption.

2. Synchronous H2D copies in prepare() are a latency bottleneck

Each cache miss does a synchronous copy_() from CPU pinned memory to GPU. For large expert weights (e.g., DeepSeek-V2's 160 experts with ~7M params each), a miss could take 1-2ms per expert. If multiple misses occur in one forward pass (common with top-k=6 routing), this serialized copy could add 5-10ms per layer.

Consider using torch.cuda.Stream for async H2D copies with an event-based sync, or batching all misses into a single torch.cat + copy. The current approach is correct but may significantly impact throughput in practice.

3. The mapping tensor in prepare() is recreated every call

mapping = torch.zeros(self._num_experts, dtype=torch.int64)
for expert_id, slot in self._expert_to_slot.items():
    mapping[expert_id] = slot
mapping = mapping.to(device=topk_ids.device)

This allocates a new CPU tensor, fills it with a Python loop, and transfers it to GPU on every forward pass. For a model with 160 experts and 60+ layers, this adds up. Consider keeping a persistent _mapping tensor on GPU and only updating the changed entries in-place.

4. _forward_with_expert_cache bypasses several runner features

The cache forward path calls fused_experts() directly, bypassing the normal runner's handling of:

  • w13_bias / w2_bias (MoE layers with bias)
  • Expert-parallel scatter/gather
  • Scale tensors for quantized weights (w13_weight_scale, w2_weight_scale)
  • Custom activation functions beyond self.activation

The EP and quantization incompatibilities are documented, but the bias case isn't mentioned. If any MoE model uses bias terms, this path would silently produce wrong results.

5. Missing enforce_eager validation

The docstring says --enforce-eager is required, but I don't see validation that rejects moe_expert_cache_size > 0 when enforce_eager=False. The @torch.compiler.disable decorator on _forward_with_expert_cache helps, but if CUDA graphs are used at a higher level, the dynamically changing buffer contents would cause correctness issues. Consider adding a config validator that errors out if moe_expert_cache_size > 0 and not enforce_eager.

6. Memory accounting

When expert weights are allocated on CPU pinned memory, vLLM's GPU memory profiler won't account for them. This means gpu_memory_utilization calculations will over-estimate available KV cache memory by the amount of expert weight memory that was moved to CPU. The profiler may need to be made aware of the CPU pinned allocation to avoid OOM during KV cache allocation.

7. Tests are good but limited

The correctness test (compare_two_settings) verifies output token matching, which is the most important thing. Consider also testing:

  • Cache hit/miss counters (to verify the LRU logic is working)
  • Edge case: cache_size >= num_experts (all experts fit, no eviction)
  • Edge case: cache_size = 1 (maximum eviction pressure)

Overall this is a solid first implementation of MoE expert offloading. The main production concerns are the synchronous H2D copy latency and the missing enforce_eager validation.

@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Mar 16, 2026

Thanks for the thorough review @alvinttang! Addressing each point:

1. Thread safety — Added a comment in ExpertLRUCache noting the single-threaded assumption (68c81df). You're right that if vLLM ever supports concurrent forwards with shared weights this would need a lock.

2. Synchronous H2D copies — Agreed, this is the main latency bottleneck. Async H2D with double-buffered CUDA streams (the "DBO scheduling" from RFC #33869) is the top item in the planned PR 2. Mentioning it here so it's on record.

3. Persistent mapping tensor — Implemented in 68c81df. _mapping is now a persistent [num_experts] GPU int32 tensor, updated in-place at each miss. Eliminates the per-call CPU allocation + Python loop + H2D transfer from the hot path.

4. Bias bypass — Guard added in 68c81df: _maybe_init_expert_lru_cache() checks moe_config.has_bias and logs a warning + returns early, so the cache is disabled rather than producing wrong results. A follow-up PR can wire bias tensors through (they're small, so CPU-pinning them is trivial) when a bias-using MoE model needs offloading.

5. enforce_eager guard — In the code since 68c81df. From FusedMoE.__init__():

if self._moe_expert_cache_size > 0 and (
    not vllm_config.model_config.enforce_eager
):
    logger.warning(
        "moe_expert_cache_size requires --enforce-eager; ..."
    )
    self._moe_expert_cache_size = 0

The cache is silently disabled (not just warned) when enforce_eager=False.

6. Memory accounting — Valid concern. The GPU profiler won't see CPU-pinned allocations, so it will over-allocate KV cache against memory that expert weights no longer occupy. This is actually a benefit (more KV cache headroom), not a hazard — the expert weights are no longer on GPU. But you're right that if someone relies on gpu_memory_utilization for precise sizing, the accounting is off. I'll add a note to the PR description.

7. Tests — 16 unit tests in tests/kernels/moe/test_expert_lru_cache.py (618392a), covering: hit/miss counters, LRU eviction correctness, slot remapping, GPU buffer content post-eviction, dtype preservation, CPU pinned backing store, FP8 per-slot scale buffering, and the no-scales path. Edge case capacity >= num_experts (no eviction pressure) is implicitly covered by _free_slots never emptying in those scenarios.

@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch 5 times, most recently from 4db08e9 to 618392a Compare March 16, 2026 21:22
@e1n00r e1n00r marked this pull request as ready for review March 17, 2026 07:38
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 17, 2026

Hi @e1n00r, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch 2 times, most recently from 29afd27 to 6af6bba Compare March 17, 2026 10:56
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 17, 2026

Hi @e1n00r, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@vlascik
Copy link
Copy Markdown

vlascik commented Mar 17, 2026

Also check this paper: https://arxiv.org/html/2410.17954v1

Instead of LRU, they load with a predictor:

"ExpertFlow consists of three key components: the Routing Path Predictor, the Expert Cache Engine, and the Token Scheduler.

Leveraging the three synergistic components of our system, ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed."

@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Mar 17, 2026

Also check this paper: https://arxiv.org/html/2410.17954v1

Instead of LRU, they load with a predictor:

"ExpertFlow consists of three key components: the Routing Path Predictor, the Expert Cache Engine, and the Token Scheduler.

Leveraging the three synergistic components of our system, ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed."

If I do that we just made powerinfer again, which is a well established solution in its own right.
Also this would necessitate training of predictor models (just as powerinfer does).
If I were the target audience I would just use that backend instead.
The point of vLLM for me at least, is its scalability and wide support, adding this requirement would make the feature nigh useless.
perhaps a middle ground? something that learns on the fly?
Anyway, I would push anything of that magnitude to another PR.

@vlascik
Copy link
Copy Markdown

vlascik commented Mar 17, 2026

Well, apparently there's quite a few options here:

  1. Frequency–recency hybrid (statistical scoring)
    A. Least Cache Priority (LCP) / exponential decay scoring. Combines frequency (μ) and recency gap (ν) into a single score
    B. ARC (Adaptive Replacement Cache)
    C. LFU

  2. Reuse-distance / stack-distance models
    D. LIRS (Low Inter-reference Recency Set)
    E. Reuse-distance–based admission control (general approach)

  3. Structure-aware (MoE-specific statistical policies)
    F. Layered-LRU (LLRU)
    G. Miss-rate–constrained caching (global statistical control)

  4. Partial / fractional caching (statistical resource allocation)
    H. Bit-sliced / fractional expert caching (DBSC)

  5. Admission-control + statistical filtering
    I. Probabilistic admission (TinyLFU-style ideas applied to MoE)

  6. Hybrid statistical + constraint-based approaches
    J. Multi-tier statistical caching

But these are not better than Predictor-based systems (e.g., ProMoE, ExpertFlow) and Learned replacement (e.g., FlashMoE ML policy).

Strong non-ML alternatives:
Best general: ARC, LIRS
Best MoE-specific: LLRU, LCP
Most novel: bit-sliced / fractional caching
Most promising direction (non-ML): Score-based caching

Of course, that's all for another PR, it's important to at least get this caching strategy ball rolling - the possible speedups seem to be massive. Maybe it would be nice to make the strategy pluggable?

caiovicentino pushed a commit to caiovicentino/vllm-expert-offload that referenced this pull request Apr 3, 2026
Standard LRU lets early layers monopolize the cache because they execute
first every forward pass. LFRU tracks per-expert access frequency
(decayed) and evicts the expert with lowest score = freq / (1 + recency).

On GPT-OSS-20B: deep-layer hit rate improved from 0-8% to 52-94%.
Critical for models with 128 experts/layer (Gemma 4, Nemotron).

LFRUCachedWeightProvider is a drop-in replacement for CachedWeightProvider.

Ref: vllm-project#37190 (e1n00r LFRU findings)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@caiovicentino
Copy link
Copy Markdown

Gemma 4 26B-A4B-it validation (128 experts × top-8, 30 layers)

Tested the rebased fork on google/gemma-4-26B-A4B-it — a very different MoE architecture from Nemotron:

Nemotron 30B-A3B Gemma 4 26B-A4B
Experts/layer 128 (top-6) 128 (top-8)
Layers 52 (hybrid Mamba+MoE) 30 (hybrid attention)
Attention 6 attention layers Sliding window + global
head_dim 128 256 (heterogeneous, global=512)

Results (cache_size=8, RTX PRO 6000 Blackwell)

Metric Value
Speed 14.8 tok/s
Model VRAM 8.62 GB
Load time 212s
Generation ✅ Correct (math, science, code)
Expert LRU cache enabled for language_model.model.layers.0.moe.experts: 8/128 experts cached on GPU.
...
Expert LRU cache enabled for language_model.model.layers.29.moe.experts: 8/128 experts cached on GPU.
Model loading took 8.62 GiB memory and 211.945657 seconds

Required fixes for Colab/Jupyter

Had to add a try/except in vllm/utils/system_utils.py:suppress_stdout() because sys.stdout.fileno() raises io.UnsupportedOperation in ipykernel subprocesses (EngineCore runs as a separate process that inherits Jupyter's stdout). Fix: commit dab55b3.

Also quantized with PolarQuant Q5

Quantized all 7,680 MoE expert weights (3D nn.Parameter tensors) + 427 nn.Linear layers with PolarQuant Q5, saving codes to HuggingFace. Download reduced from 51.6 GB → 26.9 GB.

Note on LFRU

Looking forward to testing the LFRU policy on this model. With 128 experts × 30 layers and top-8 routing, the deep-layer cache starvation you described for GPT-OSS-20B should be even more pronounced here. I've implemented an initial LFRU version in the fork (commit 7a19e4d) — happy to coordinate on this.

@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Apr 3, 2026

@caiovicentino — the LFRU results confirm what we measured independently. We have LFRU and 6 other policies implemented and benchmarked in tinyserve.

Based on your validation (LFRU cache=8 exceeds LRU cache=16 in hit rate) and our own benchmarks (deep-layer starvation eliminated, +5-50% throughput), we've updated the PR to ship LFRU as the built-in eviction policy instead of LRU. No config field, no pluggable framework — just the better algorithm baked in. Keeps the PR focused.

The Gemma 4 validation (14.8 tok/s, 8.62 GB) is strong — two models on two architectures with correct output.

For the rebase: we've rebased onto current main using your conflict resolutions (co-authored). One clean commit, 14 files, no workflow changes.

For experimentation: if you want to try ideas freely — buddy substitution, CPU-on-miss, dynamic VRAM rebalancing, imatrix cache seeding, per-layer cache budgets — tinyserve is the playground. It's ~7K LOC Python with 340 tests, all of these features implemented and benchmarked. Alternatively, we could set up a shared experimental branch on the vLLM fork for testing concepts before they go into a PR. Either way works — happy to coordinate.

@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch 2 times, most recently from 73a7584 to f3e6781 Compare April 3, 2026 14:19
@mergify mergify bot removed the needs-rebase label Apr 3, 2026
@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch 2 times, most recently from e07f615 to 71ed1fc Compare April 3, 2026 14:41
@mergify mergify bot added the performance Performance-related issues label Apr 3, 2026
@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch from 71ed1fc to 73f9f89 Compare April 3, 2026 15:01
@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Apr 7, 2026

@mgoin — could you add the ready label to trigger CI? All pre-commit checks are now passing (mypy clean, ruff formatted). Two independent validations in the comments: Nemotron-Cascade-2-30B-A3B (7.6 GB VRAM, 15+ tok/s) and Gemma-4-26B-A4B-it, both with correct output.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @e1n00r.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 7, 2026
caiovicentino pushed a commit to caiovicentino/vllm-expert-offload that referenced this pull request Apr 7, 2026
… Nemotron fixes

Cherry-picked and resolved conflicts for all 7 commits from
e1n00r/vllm@feature/moe-expert-lru-cache onto vllm-project/vllm main.

Resolved conflicts in:
- layer.py: merged expert cache init with current __init__ structure
- unquantized_fused_moe_method.py: merged provider check with current API
- fp8.py: added cache init call
- offload.py: added moe_expert_cache_size field

Tested on Nemotron-Cascade-2-30B-A3B (RTX PRO 6000 Blackwell):
- cache=8: 15.6 tok/s, correct output
- cache=16: 19.6 tok/s
- cache=32: 24.4 tok/s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
caiovicentino pushed a commit to caiovicentino/vllm-expert-offload that referenced this pull request Apr 7, 2026
Standard LRU lets early layers monopolize the cache because they execute
first every forward pass. LFRU tracks per-expert access frequency
(decayed) and evicts the expert with lowest score = freq / (1 + recency).

On GPT-OSS-20B: deep-layer hit rate improved from 0-8% to 52-94%.
Critical for models with 128 experts/layer (Gemma 4, Nemotron).

LFRUCachedWeightProvider is a drop-in replacement for CachedWeightProvider.

Ref: vllm-project#37190 (e1n00r LFRU findings)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@caiovicentino
Copy link
Copy Markdown

@e1n00r — We rebased the expert offloading code on top of the latest upstream/main (70406eb, April 7th) and confirmed it works end-to-end with a CompressedTensors INT4 MoE model.

Test results (Qwopus-MoE-35B-A3B INT4 CT, RTX PRO 6000 Blackwell 102 GB):

Config tok/s VRAM
All-in-GPU (vLLM 0.19 native) 23.6 ~25 GB
--moe-expert-cache-size 8 37.4 ~8 GB
BF16 Original 16.2 ~72 GB

The LFRU cache is 1.58x faster than all-in-GPU — better memory locality with 8 hot experts vs 256 scattered. Coherent output confirmed (not garbage).

Our rebased fork: https://github.com/caiovicentino/vllm-expert-offload (branch main, commit f324bd9)

The rebase resolved one conflict in unquantized_fused_moe_method.py (new XPU branch added upstream). All 9 commits apply cleanly. Happy to help with the rebase on your side — you can cherry-pick from our branch if useful.

Co-authored-by: Caio Vicentino caiovicentino@users.noreply.github.com
Co-authored-by: Claude noreply@anthropic.com

@caiovicentino
Copy link
Copy Markdown

Benchmark charts for the INT4 MoE test above:

Benchmarks

Model card with full details: https://huggingface.co/caiovicentino1/Qwopus-MoE-35B-A3B-PolarQuant-Q5

PPL 6.56 on full WikiText-2 (295K tokens) — virtually identical to BF16 baseline. The LFRU expert cache with only 8 hot experts is faster than loading all 256 into GPU.

@tkj666
Copy link
Copy Markdown

tkj666 commented Apr 9, 2026

Hi @e1n00r @caiovicentino ! Thank you for your excellent work and test results. I have a small problem though: when the cache is overflowed during prefill, the current behaviour is to keep the suffix of the required expert list, which consists of ones with the greatest ids since the list sorted by unique(). I doubt that whether the experts kept this way are "most likely to be needed in upcoming decode steps", and I am wondering how it will impact the model performace without the results of the discarded experts.

e1n00r and others added 2 commits April 9, 2026 16:39
…ia expert CPU offloading

Expert weights live in CPU pinned memory; a GPU cache holds the
hottest N experts per layer using LFRU (frequency-weighted LRU)
eviction. LFRU prevents early layers from monopolizing the cache —
a known problem with pure LRU in sequential MoE execution.

On cache hit: zero-copy GPU forward from fixed-address buffer.
On miss: synchronous H2D copy, expert remapped to cache slot.

CLI: --moe-expert-cache-size N --enforce-eager
Config: OffloadConfig.moe_expert_cache_size

Tested on OLMoE-1B-7B (8 GB GPU), Nemotron-Cascade-2-30B-A3B
(7.6 GB VRAM, 15+ tok/s), and Gemma-4-26B-A4B-it (8.6 GB VRAM,
14.8 tok/s). LFRU validated independently on Nemotron: cache=8
LFRU exceeds cache=16 LRU in hit rate. RFC: vllm-project#38256

Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so>
Co-authored-by: Caio Vicentino <caiovicentino@Mac.lan>
Co-authored-by: Claude <noreply@anthropic.com>
- expert_weight_provider.py: assert best_key is not None before dict.pop()
  (best_key is set by the loop which runs when _lru is non-empty)
- unquantized_fused_moe_method.py: assert experts_cls is not None before
  make_unquantized_moe_kernel() in cache-active path (mirrors line 193)
- ruff format: layer.py, unquantized_fused_moe_method.py, test_expert_lru_cache.py

Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so>
@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch from 30bdcda to 1844fbb Compare April 9, 2026 14:40
…ent truncation

Prefill batches can activate more unique experts than gpu_capacity.
The previous code kept the highest-ID experts (by unique() sort order),
which is arbitrary and produces silently incorrect outputs: tokens routed
to dropped experts compute with stale slot weights from a different expert.

Replace with a hard RuntimeError pointing users to --moe-expert-cache-size.
Chunked-prefill-aware loading (sub-batch within capacity) is the correct
long-term fix and will come in a follow-up PR.

Fixes concern raised by @tkj666 in PR review.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so>
@e1n00r e1n00r force-pushed the feature/moe-expert-lru-cache branch from 1844fbb to 1a8df90 Compare April 9, 2026 14:41
@mergify mergify bot removed the needs-rebase label Apr 9, 2026
@caiovicentino
Copy link
Copy Markdown

@tkj666 great catch — this is a real limitation and worth being explicit about.

You're correct: torch.unique() returns the tensor sorted, so slicing the suffix keeps the highest expert IDs, which has no relationship to access order. It's arbitrary eviction during prefill.

Why it ended up this way

During prefill there's no LRU history yet — no prior tokens have touched the experts, so we have zero signal about which to keep. We defaulted to "keep the suffix of unique()" as a placeholder without claiming it's optimal. The LRU signal only starts accumulating once decode begins.

The real issue

For short prefills (len(unique_experts) ≤ cache_size) this never matters — everything fits. For long prefills where unique experts exceed cache size:

  1. Prefill thrashes, forced to reload overflow experts repeatedly
  2. The experts that survive are arbitrary high-ID ones
  3. Decode starts with a cache populated by essentially random experts
  4. LRU gradually corrects to real decode patterns

Better policy: prefill-access-order LRU

The right fix is tracking expert access order during prefill in a small ring buffer and using that as the initial LRU state when decode begins. Turns "arbitrary prefill state" into "actual recency from prefill traffic" at near-zero cost.

Benchmark plan

We want to quantify the quality impact before picking a policy. We're planning to compare on a long-prompt workload with moe_expert_cache_size=4:

Policy Description
current suffix of sorted unique
random random eviction during prefill
prefix keep lowest IDs (symmetric check)
prefill-LRU track prefill access order, seed decode LRU

Thanks for flagging this — it matters for long-context / small-cache workloads exactly like you describe. Will follow up on this thread when we have numbers.

@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Apr 10, 2026

@tkj666 @caiovicentino — thanks both. The fix is now in 1a8df90.

Digging into it we found the suffix-of-unique() behavior is actually a correctness bug, not just a quality-of-eviction issue:

_mapping is initialized as torch.zeros(num_experts, dtype=int32) and is only updated for experts that are loaded into a slot. On overflow, the dropped experts keep whatever value _mapping had — slot 0 on a cold cache, or a previously-evicted expert's old slot on a warm cache. The kernel then reads _mapping[dropped_id], indexes into the GPU buffer, and computes with a different expert's weights for every token routed to a dropped expert. The tokens still get a real matmul — just with the wrong weights, and the wrong activations propagate downstream.

Given that, any truncation policy (suffix / prefix / random / prefill-access-order) has the same defect unless we also fence off the dropped experts in _mapping. For PR1 we went with fail-fast:

if len(unique_ids) > self.capacity:
    raise RuntimeError(
        f"CachedWeightProvider: {len(unique_ids)} unique experts requested "
        f"but --moe-expert-cache-size={self.capacity}. "
        f"Set --moe-expert-cache-size >= {len(unique_ids)}."
    )

The prefill-LRU seeding @caiovicentino sketched is the right direction for a follow-up PR alongside chunked-prefill-aware loading — once the kernel has a way to skip tokens whose expert is not resident, the policy question becomes meaningful. Until then, "fit all unique experts in capacity, or error" is the only correct option.

Practical guidance for hitting this: either set --moe-expert-cache-size >= num_experts (all experts GPU-resident — cache disabled at steady state, still saves the CPU→GPU load latency on cold starts), or reduce --max-num-prefill-tokens so a single chunk stays within capacity unique experts.

@tkj666
Copy link
Copy Markdown

tkj666 commented Apr 10, 2026

@e1n00r — I think that we can make prepare() to return a generator of a set of at most capacity experts, and call the kernel over every set before combining the results during prefilling. I suppose that the kernel supports masking out some of the experts like in EP cases, though I haven't really looked into it. The described procedure described above is virtually EP with each replica executing one by one on the same gpu.

@e1n00r
Copy link
Copy Markdown
Author

e1n00r commented Apr 10, 2026

Empirical validation

Tested "partition experts into N groups, call fused_experts once per group with expert_map[id]=-1 for out-of-group, accumulate partials" on an RTX PRO 2000 Blackwell. Results across shapes from (M=4, E=8, top-2) through (M=1024, E=32, top-4) and 2/4/8/16 groups: bitwise equal to single-call baseline at smaller shapes; at larger shapes the difference is pure dtype rounding from non-associative bf16 summation — measured ~6.5e-3 rel err for bf16, ~8e-4 for fp16, ~1e-7 for fp32, each exactly matching the mantissa floor. The slot-space decomposition that CachedWeightProvider would actually use (sliced w1[:cap]/w2[:cap] per group with remapped slot ids) is also bitwise equal to the baseline.

The constraint: use fused_experts_impl, not the modular kernel

The provider currently routes through TritonExperts.apply (modular kernel). That path is incompatible with cross-call output accumulation because of modular_kernel.py:1080-1091: output and workspace13 are both views into common_workspace, and WorkspaceManager in v1/worker/workspace.py keeps one persistent buffer per ubatch. Call N+1's activation writes intermediate_cache2 (which aliases output storage via workspace13), clobbering call N's output. A loop with output += partial_g on the modular path silently corrupts the accumulator.

The non-modular fused_experts_impl path is the correct target: it allocates cache13/cache2/out_hidden_states fresh per call with no shared workspace, already calls intermediate_cache3.zero_() when expert_map is not None (fused_moe.py:1862-1863), and uses ignore_invalid_experts=True in moe_align_block_size to drop -1-routed tokens at sort time rather than writing zeros through the kernel. Empirically verified: each fused_experts(..., inplace=False) call returns an independently-allocated tensor (distinct data_ptr()), so outer accumulation across calls is safe.

Sketch for forward_native:

if provider is not None:
    chunks = provider.prepare_chunked(topk_ids)  # yields one chunk when unique fits
    if len(chunks) == 1:
        r = chunks[0]
        return fused_experts(x, r.w1, r.w2, topk_weights, r.topk_ids,
                              inplace=False, expert_map=r.slot_map)
    acc = torch.zeros_like(x)
    for r in chunks:
        acc += fused_experts(x, r.w1, r.w2, topk_weights, r.topk_ids,
                              inplace=False, expert_map=r.slot_map)
    return acc

Cost model: PCIe transfer dominates

Measured on the target hardware at M=512, K=2048, N=2048, E=32, top-4, 24 MB per bf16 expert (w13+w2):

Configuration Per layer Bottleneck
Persistent full-cache (capacity=32) 6.5 ms gemm compute
Mini-EP 2/4/8 groups, cold full swap (cap=16/8/4) ~100 ms PCIe transfer, ~7 GB/s observed, ~3.5 ms/expert
Mini-EP 2 groups, warm cache (cap=28), 2 new experts ~17 ms PCIe transfer (partial) + extra kernel call
Mini-EP 2 groups, warm cache, 0 new experts ~12 ms extra kernel call overhead

Kernel launch overhead is secondary: +6% for 2 groups, +15% for 4, +35% for 8 at M=1024 prefill (5-6 Triton launches per call × ~100 μs each). The 15x slowdown in the cold-swap case is entirely PCIe bandwidth — total bytes transferred are identical regardless of how you partition the groups.

Practical consequence for the design: mini-EP is the correct fallback when VRAM cannot fit all experts, and --moe-expert-cache-size >= num_experts remains the recommended path whenever it fits. The overhead grows linearly with the number of new experts per layer, so the design should prefer a high cache hit rate over a clever grouping strategy.

Scope: PR2 alongside async H2D

Holding this out of PR1 for three reasons:

  1. PR2 already restructures prepare() and the forward loop for the async copy stream. Mini-EP and async prefetch share the same API surface (prepare_chunked() with optional pre-fetched slots), so designing them together avoids an architectural carve-out that a PR1 bolt-on would require.

  2. Routing mini-EP through fused_experts_impl means the provider path diverges from the modular kernel infrastructure for shared_experts / quant_config / prepare_finalize dispatch. That's a deliberate architectural choice deserving its own design discussion with @mgoin, rather than a late addition to a month-old PR.

  3. Two open questions require real-model runs:

    • Does try_get_optimal_moe_config cache-hit across groups when effective num_valid_tokens changes per group? A miss means autotune re-runs every group, destroying first-forward latency.
    • Does bf16 accumulation drift over 30-60 layers × hundreds of forward passes stay within atol=2e-2? Microbenchmark drift is at the mantissa floor; real-model drift under repeated reductions is untested.

Happy to prototype this on a follow-up branch after PR1 merges — your framing ("mini-EP with each replica executing one by one on the same GPU") is the right mental model, and the two non-obvious implementation points are (a) the non-modular routing required to avoid the workspace aliasing, and (b) hoisting unique() out of the per-group loop to pay the D2H sync once per forward rather than N times.

@caiovicentino
Copy link
Copy Markdown

Thanks for the rigor on this — the workspace aliasing finding alone is the kind of latent bug that would survive unit tests and only surface in real-model accumulation. Calling it a correctness defect rather than a "be careful" footnote is the right framing.

A few notes on each section:

Empirical validation. The bitwise-equal-at-small / mantissa-floor-at-large pattern is exactly the equivalence story you can defend in a PR description without hedging.

fused_experts_impl vs modular kernel. I hadn't connected the intermediate_cache2 / workspace13 aliasing to mini-EP. Routing through fused_experts_impl is the right call, and ignore_invalid_experts=True already in moe_align_block_size makes the slot-map decomposition much cleaner than I'd sketched.

PCIe dominance. Matches what we saw on Nemotron-Cascade-2-30B-A3B and Gemma-4-26B-A4B. The LFRU validation reproduced the same early-layer-hot / deep-layer-starved pattern you described on GPT-OSS-20B, and the hit-rate gains were the dominant lever — bigger than any grouping cleverness we tried. Your 15x cold-swap measurement reinforces that "fit, or fail fast" is the only honest PR1 contract.

PR2 scope. Agree on holding it out. On the two open questions:

  1. Autotune cache invalidation across groups — I can run this on Nemotron-Cascade-2-30B-A3B and Gemma-4-26B-A4B-it (the two architectures already in this PR's validation set). They stress autotune differently — Nemotron's deep layer count surfaces shape-driven misses, Gemma's smaller per-layer expert count surfaces expert-id-driven ones.

  2. bf16 accumulation drift over 30-60 layers — same models, full forward sweep with running max-rel-err against a fp32 reference. If drift stays below 2e-2 after 100 forward passes, the bound holds. Above that, chunked-prefill becomes load-bearing.

Happy to do both on whatever branch you set up after PR1 merges. Ping me when there's a target commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants